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DECENTRALIZED CONTROL OF MARKOVIAN DECISION PROCESSES: 


EXISTENCE OF o^ADMlSSIBU POLICIES 


1. INTRODUCTION 

Qaasical control thewy deab with the probtem of controning the peifonnaiice of a dynamical 
system. In their earliest fcwm. control modeb were completety deterministic in the sense that there 
was no probalistic or tandom latme to them. The first stochastic ctmtrtd mottel simply added a 
ptohabtlKtic disturbance term to die pievioudy deterministic moikL yet it did so with tremendous 
utility. 

A further develtHmient in the control of dynamical systems took place with the addition of 
more sul«tantive stochastic compoiMnts into the model. For example, one could consider con* 
trolling an industrial production modd by monitorii^ demand for the product, a probabilistic 
process, and controlling the process by a cotaibinatimi'of factors (say. by adjisth^ size of work 
fence, amount of inventory, etc.). This model is an exampie of ndiat has been come to be called a 
Markovian decision proems. An dementaiy explication of this theory can be found in Derman (7] . 

In each of the previous models, it was assumed that contrd was overseen by one central de* 
cisionmaker with acems to all available informati<m. Many examples exist of cases where ( I ) there 
are multiple decisionmakers and ( 2) no single decisionmaker has all of the information. It is diis 
decentralized control problem that is the subject of our study. For added motivatiem we will end 
this section with several distinct examples of deoentrdized stochastic control models. 

(a) Satellite Communications and Ctmtrol 

Conner a netvork of satellites each with finite buffers onboard to store data. The main prob* 
lem to be conadered is how to store data until access to a broadcast channel is available and the buf- 
fer can be dumped. Assume data arrives at each node (i.e.. satellite) by some known random process. 
The goal is to initiate onboard control of this data management system without instantaneousl\ 
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knowing the suite of other nodes in tlie network, in particular, to avoid two satellites starting to use 
the same channel simultaneously (resulting m transmission ctdlision and data loss). The best contnd 
strategy is chosen by using a realistic cost function designed to include sudi variaUes as cost due to 
data loss, storage cost. etc. 

Fot an int«esting explication and partial stdution to this imMem, see Schoute ( 23] . 

(b) Combat Command Structure 

A very different sort of model is that of the contnri (or managmnent) of a battle by field off!* 
cers who cannot maintain perfect communication with their superiors (and certainly the “heat** of 
a battle field is not conducive to ^>od information flow). 

The question is: can one produce a strategy which minimizes battle losses and maximizes the 
chrnice of readiing an t^jective, but which functions in an environment of decentralized dedaon- 
making and infonhation flow? 

(c) Management of Marketing Force 

For an exampk which has application to the bu^ess worid. consider a sales fence out “in the 
field.** They have only bare communication (possibly they agree to call in to a central person twice 
daily) yet one wishes to produce a strategy of resource allocation which maximizes profits and mini- 
mizes duplication of effort. The random occurrences are such things as product demand, weather 
cmiditions. etc. Each salesman must make some decisions but without benefit of full information. 
Possibly a set of “guidelines** for decinonmaking could produce more overall reward than each de- 
cisionmaker maximizing only his own reward. 

2. motivation FOR CURRENT RESEARCH 

In this section, we will lay out the particular model suggesting this research. Certainly one of 
the driving forces behind this work is a NAS.A-sponsored study on Artificial intelligence and Robotics 
(usually referred to as the Sagan Report). One recommendation of that study was that NASA move 



toward employing more “state-of-the-art” technology in these areas as well as taking a leading posi- 
tion in research and developottnt of such technologies. Within that context, the following model 
was communicated to us by Mr. R. L Larsen. 

Assume first of all that the hardware and software exists for onboard satellite computer process- 
ing. By this we mean the capacity to perform a wide range of fuiKtions such as preprocessing of 
telemetry data (with the idea in mind of filtering out bad or useless data), orbit aiul attitude deter- 
mination, and corrective maneuvering, (hte could foresee some of the sdieduling and planning func- 
tions now occurring manually on the Earth being transferred for automatic control to an onboard 
network of communicating computers. 

Ihe proUem then surfaces: how does one contrd the real-time flow procesang dirot^out 
this network of computers. The object is to minimize loss of information (because of buffer over- 
flow) yet at the same time reduce costs of data transfer and other communicatum costs. Again, 
because of time delay and other constrainis^it may not be posable for each mtellite caatrcA device 
to have fuQ infotmation about states and decisions taken at other satellites. 

The madiematical model reduces to a netwcnlc of queues, where computer jobs or input data 
(in some standard units) are the objects lining up <m the queues. The contndling device is to change 
the rate of data input or output by doing one of several “actions.” For example, one could transfer 
procesang to a neighboring node in the network, one could transfer back to an essentially infinite 
queue on the Earth, one could activate reserve computing power (say on a space station), or one 
could reprioritize the procesang of jobs in a way which allows faster service. Each of those actions 
is essentially simply altering the arrival and service parameters of the queues at each node. These 
decisions are viewed as being made locally at each node and only' with partial information concern- 
ing the entire network. For example, a particular node may only have vital information concerning 
itself and. say. its two closest neighboring satellites: however, it must decide how to control its 
processor based on that incomplete knowledge. The details of what one means by a controlling 
device for such a decentralized system and ways to obtain such de>'ices are the basic problems to be 


addressed in this document. 
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3. MATHEMATICAL PRELIMINARIES 

We present in this section a short discussion of Markovian dec^on models. An excdlent ele> 
mentary introduction is that due to Derman (7) , while more sophisticated treatments will be found 
in Hinderer |9I and SchSl (22). 

A Markovian decision process is a stochastic process modeling the time evdution of a dynamic 
sy stem which is ''controlled'* by sequences of deciaons (actions) periodically or at regular time inter* 
vals. The process may evolve eidier in a discrete or condnuottt time fhone: however, for reasons of 
simidicity and as we shaD later see with no loss in generality, we will consider only discrete time 
processes in this report. 

The model is described by a S*tuple (S, A. D, p. r) with the following interpretations: 

(1) S is the set of states of the system. Ingeneral. S may be taken to be a standard Borel space: 
but for the purpo^ of this report, we assume S is at most countaUe. 

(2) A is the set of posaUe actions , ^ain. there.are general assumptions diat can be made con* 
ceming A. but we assume A is a set satisfying a decompoation diat will be described in the next 
paragraphs. 

(3) D is a function from “histories” of the process into subsets of actions. By a history at time 

t.h,. we mean a member of the set: ■ Thus, a history, h,. looks like: 

(Xq. Sq, X, . a, X,. a^ x^. a^) where 

Xq * state of time 0 

aQ « action taken at time 0 

X| “ state of time I 

a, “ action taken at time t. 
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An aufmentcd history simply adds the state at time t ♦ 1 onto the histoiy at time t We will denote 
it by the symbd. (h^. ). 

The function 0 associates to each augmented history, (h^, ), die set of actions avulable to 

the decisionmaker when the process evolved exactly m that described in the augmented history. 

For the purpose of this study, we wili have the following assumption and decomp<»ition: 

(a) D(hf. X(^|) is finite for dl(hj.X|^|) 

(b) A • u{D(hj. X(^, )|(h|, Xf^i ) is an augmented history} 

(4) The symbol p represents die (Markov) tranation probability . For each ae A. x, y eS. we 
interpret the symUd p^yfa) to be die probability of changing from state x to state y under action a. 
Of course, the following regularity conditions must hold: 

(a) Pxy(e) is defined only when it makes sense (i.e., a must belong to D(hp x) for some history, 

(b) I ° I for all xeS and aeOfh^, x) for some h^. 

yeS 

This transition probability represents the uncontrdlable probabilistic aspect of the model. 

(5) The last element of the model is the function, r, the reward function. Formally r is a lunc* 
don from state/action pairs into the real numbers. r(x,a) is interpreted to mean the reward (ne^dve 
reward is ‘cosOassociated with choosing action a when in state x. Of course, r is defined for all 
xeS and aeDfh,. x) for some h^ . 

The basic problem associated with this model is to '‘control’' the time evolution in such a way 
as to ''maximize” reward. There are still two key concepts yet undefined. 

Let be the set of all histories up to time t. By a policy, e, we mean a sequence, ()Tq, it, , ir2- 
. . . > where XS ** ^(A) and *^(A) is the set of probabilii>' distributions on the action space. 

ir,(.lhp x) is the probability distribution on A when the system has experienced an augmented 
history , (h,. x). Since we are assuming D(h,. x) is a finite set. .Ih,. x) can be interpreted as a dis- 
crete probability density on D(hj. x). Let the symbol. denote the set of all such policies. 


5 



If for each t the probability is concentrated on one state. i.e., ff((y|h|. x) ■ I for some state 
y eS. then we call ir a deterministic policy. 

It is the policy ir that in fact “controls’* die time evolution of the process. It is desmble to find 
deterministic policies because otherwise the decisionmaker is faced with the unpleasant option of 
having to perform some random procedure (say. flip a theoretical coin) to decide which action to 
take when in some state. 

As a means of picking an <H>dinal ptdicy, we will use die reward function defined above. Furdier, 
we want to pick a ptdicy that is good throu^out the evolution of the process (or at least to some 
large finite time frame). It is here that there is significant divergence in the analysis. For the purpose 
of this study, we will restrict ourselves to “discounted” reward functions, but we acknowledge that 
other reward functions are of interest and deserve similar investigation. 

Let a be a fixed discount factor. 0 < a < 1 . Let e be a fixed pcdicy. The theory of stochastic 
processes insures that to each policy and to each initi^ dKtribution on the state at time 0. there is a 
stochastic process generated. We denote the stochastic process by (X^. A^) where is the state at 
time t and A^ is the action taken at time t. We will use the notation and for the probability 
and expectation operators under the policy v. 

Define V^(ir, x) = o*r(Xj, Aj)|Xq * x|. This is the discounted reward function given 

the process starts in state x. Snee we will be assuming a is constant throughout, that symbol will 
be suppressed in the notation. 

Next define. V(x) ^sup{V(]r, x)|ffeA}. If there exists a policy v* such that for all xeS. 

V(ff*. x) ■ V(x), then we say ir* is an optimal policy . 

The basic results in this area are ( 1 ) theorems providing (under appropriate assumptions) 
existence of optimal (possibly deterministic) policies and (2) theorems outlining methods of com* 
puting those policies. We refer the reader to Derman [7] for a documented account of such results: 
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however, one coidd not proceed without at least mentioning the **principle of optimality" of Bellman 
because of its motivational value. One can usually prove (in centralized contrd models) that the func* 
tion V(x) satisfies the following equation: 

V(x) = sup{r(x, a) + o 2 Pxy(a) V(y)} 

The interpretation is important: tiie equation says that the expected discounted reward starting 
from state x is the same as the sum of the one step cost of being in state x under the **best action." 
a. plus the wei^ted and discounted cost of starting from anotiter state, y, one time unit later. 

This equation is useful in deriving algorithms for computing optimal strategies. 

4. DECENTRAUZED MARKOVIAN DECISION MODELS 

In this section we will build upon the structure discussed in Section 3 to apply it to a decentral* 
ized control situation. There is no claim of full generality in the model to be described: in fact, we 
have limited the scope initially for ease of analysis. . 

Consider a network with N nodes, and assume tiiat the state of the system at each node can be 
described by a non*negative integer. A useful example to have in mind is that at each node there is 
a queue where an integer will describe the numb.r of people in line or in service at that queue. The 

state of the entire network is thus given by a vector (nj , n^ nj^) where h| is the state of node i. 

Let A^ be the set of actions availaUe to a controller at node i. We can fit this situation into the con* 
text of the previous section by defining: 

S = Sj X X and A ® A, x x Aj^ 

where Sj is the set of states available at node i. 

The key ingredient that must be added now is that of an information structure. Several authors 
have dealt with this proMem already. With no pretense of being complete we mention two sources 
(in very different disciplines) dealing with this issue. See. for example. Marschak and Radner [19] 
or Witsenhausen (301 for more detafls of work in this area. 
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In this paper an information structure, o, will be a flnite sequence of projection functions 

^ 

u^iere each 0| maps the set Sj x . . . x S]^ onto some subset of tite S|'s. say, , x . . . x S||^. 

For example, suppose that 0|(xj , X2 x^^ ) » (Xj . X3, Xj,j). We interpret the function, Oj . 

as saying that controller 1 has full information about nodes 1,3 and K. 

To simplify notation we introduce the itte of 0| as a superscipt Its use is meant to simply apply 
the appropriate projection operation whenever it is needed. It is best to illustrate by example. 

If Oj is as above, by (Xj vector, (Xj . X3. x^^). Similarly 0| can be 

‘'applied'* to actions as follows: 

o, 

(a|,,..,a|^) ® (a|, Sj. ajj). 

In fact, we want to also use die notation freely with such complicated objects as histories. If h, is a 
history in the decentralized model, then h^ looks like:<hj ” (Xj , . . . , xj^). (a^ , .... a^>. (x| . . . . , x|^). 
iaj a|j), . . . ) where x|anda| are die sute and action at time t at node i. 

By (hj)®* we mean: ((x®, x®. x^). (a®, a®, aj^), . . . ). In other words, the superscript Oj indi- 
cates diat all information available to controller i is extracted from the “whole system" state or 
action vectors. 

Our current g(»l is to define what we mean by a policy which is compatible with the informa- 
tion structure. It is obvious that the N decisionmakers can only “go on” the data available to them. 
This notation helps express this idea. We add that this model does not really handle time delays in 
information which were noted in Section 2 as being important. A notation more useful to those 
features is that of Witsenhausen 1 301 . but we will forego those complications in the context of this 
document. 

To complete the Markov decision model, we assume the existence of a transition probability 
p. a decision structure D and a one-step cost function r as described in Section 3. 
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Q. Q 

By a osidmitsible policy, n, we mean a policy such that whenever (h^, k) ' ■ (h|. x’) 'we have 
P,(A*‘ • a®‘|h,, X) P^(A®‘ • a®‘|h;, x*). 

The above deflnition simply quantifle» what we mean by decentralized control: i.e.. whenever 
a controller, i, has a certain information configuration, he will always act in the same way even if the 
liistory or state at other nodes is different The set of all such o>admimiMe policies is denoted A9. 

The mathematical problems are similar to Utose in the standard control problem. For example, 
does there exist a tradmissible optimal policy? If so, is it deterministic? Does the discounted reward 
function satisfy a functional equation similar to the ‘^Principle of Optimality?” Finally, are dtere 
reasonable and imidementable algorithms that can be used to produce the optimal policies? 

in Section S we win answer the first question, and in Section 6 we wiU surest some approaches 
that could be useful in answering the remaining questions. 


5. EXISTENCE OF o^ADMlSSIBLE POUCES . 

We begin this section by defining the topology on the space of policies A. We say rr^"^ a sequence 
of policies in A, converges to ff if and only if for each t » 0, 1,2,... :x6S;h(€H^ and:aeD(hp x) 
lim U|h,. x) » v(a|h., x). This is exacdy dre standard product topology in the compact space 

(h,, x) 
and 

t«0, 1. . . . 


In fact, A is simidy a closed subset of the above compact space, i.e.. those elements such that for 
each t. (hj. x). the function ir : D(h^. x) x satisfies 

Sjr(a|h,,x)«* 1. 
a * 


Thus A is a compact set. 
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It is well known, see Demtan (51 . that under the assumptions of this report. V,(x) is a contin* 
uous function on the space ^ Thus, since A is compact, V, attains extreme values in the set X 

We use this fact in the following: 

Theorem 1 . There exist o*admisable policies. 

Proof . First we will clarify the meaning of o^dmissible optimality. ir* is such a policy if for 
allxeS 

V,,*8Up{V^(x)|jreA®} 

We will be able to apply the standam ..igument for continuous functions on compact sets if we can 
sltow that is itself compact. In fact, we need only show that as a closed subset of A. 

To that end let{tr^*‘^}be a sequence of policies in and assume ** ir in A. We want to show 

that ire A^. Let t be a fixed time parameter and assume that for ie{l .... , n} 

(h,,x)®‘«(h;,x')®‘ 

Since ir<“)e A*^ for each n, we have 

i * aOilh;, x') 

Since ir^"^ -► ir. then 

‘ \K » a^Jh,. X) 

and 

• sIK- «•) - P (A"' • a,,|l>;. X’) 

Thus it follows that 

P, , « a„Jh,. X) » P,(A^‘ « a Jh|, x') 

Therefore, A^ is closed. The conclusion of the theorem follows using exactly the sane argument as 
that used by Derma.t [5] . 
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6. OPEN QUESTIONS 

Tte previous sectioa settles the questimi of ndietlier it is worth lookmg for o*8dmt8sible optima 
solutioastc the deceniialieed control ptoUeai. There are several important and un an s w ered qoestkms 
in this area. 

Rrst. it is inyortant to know w h e ft er thwe «e drcumsta^tt under whidi there always exist 
detemiinistk oodnnscible optinial polictes. A detemuaistic policy is given shnply by a sequence of 
functiov, n«(f|, f2 * . . .) where each f| naps stales into actions (Le^ the actioe whkh has possi* 
hiity I of occurd^li In addhion, the defaunganwiiiionineMs that whenever (h|«x)^‘»(h|.x*)** 
then (ffx)l s fi[x'>| It is oar view at ^ point in tune that soom stiingent cooditioas have 

to be imposed in order to force the existeiKe of ^teraiinistic policies. Lflcely candidates for such 
comfitians are (1) either some heharchicaistnictiue where infoimatioa is known by “supetiois** in 
the network about all nodes ‘‘uarter” them, or (2) an ordering on actkms which would allow the use 
of monotone policy analysis. See, for examirie, Serfozo [26] for mediods in the latter diiectioa. 

The second maior area of work is wifo fiiuling uni^inaitabie a^mithms to produce the <q>ti‘ 
mal policies. KstoricaBy. with the oentndized ctmtrol problem, the discounted rewmd function. V, 
mtisfied a functkmal equation whidi allowed the use of dynamic p rogr a m m h^ techniques. Thus far 
we have been unable to obtain sudi results. 

A more hopeftd approach may be oik similar to that outline by Ho [ 10] in a recent paper. The 
idea is to pick or guess a starting optimal pdicy. One uses the general policy, teavii^ out one node 
at a time, say node i. and obtain a ‘‘revised’' solution at node i (with other nodes ftxeu). After all 
nodes have a revised solution, one compares the reward functiim under the revised solution to the 
ordinal. The algorithm concludes when no better reward function is obtained. 

Ot course there are no theorems yet (Stained in tha regmd. but we feel it is a method worth 
invest^ting. 
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7. REEXAMINATION OF MOTIVATING EXAMPLE 

We doae this report with a few oommeais concetninf the relatioadup between the prob ie io out* 
lined in Section 2 and diat discussed subse(|ttentiy. Ihe first point to make is that the control prob le m 
outfined in Section 2 is dearly a continuwB time modeL Ceitanly arrivals to the various nodes in the 
network can happen at any instant in time. H ow e v er, the madrenatics desa ft ed is for a discrete time 
proUem. 

The point is that the continuous dme probiem has the (fiscrete proUem imb e dded inside (at the 
jump pointB in the process): auL nrore i m po r t a ntly, a solution of one of the proMenn (Le.. o p ti mia * 
ii^ the rewa d function) wffl induce a solution to the other. For an exceflent expUcation of ttis 
phenomenon see Serfoeo (251 . 

A second point to be made is that the mathematics described m this document does nc* (on dre 
sivface) embrace ^ proUem of timeHiei^ed information souctum. In fact, with a notation more 
adapted to diat situation, say the notatkm in Wttsenfaauser [30] , me can include dre time^leiayed 
case also. It is expected thm subsequent work wil foll'm dut directum. 
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